#First we have to import our data
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv"
candy <- read.csv(url, row.names = 1)
head(candy, n = 5)
## chocolate fruity caramel peanutyalmondy nougat crispedricewafer
## 100 Grand 1 0 1 0 0 1
## 3 Musketeers 1 0 0 0 1 0
## One dime 0 0 0 0 0 0
## One quarter 0 0 0 0 0 0
## Air Heads 0 1 0 0 0 0
## hard bar pluribus sugarpercent pricepercent winpercent
## 100 Grand 0 1 0 0.732 0.860 66.97173
## 3 Musketeers 0 1 0 0.604 0.511 67.60294
## One dime 0 0 0 0.011 0.116 32.26109
## One quarter 0 0 0 0.011 0.511 46.11650
## Air Heads 0 0 0 0.906 0.511 52.34146
Q1. How many different candy types are in this dataset?
nrow(candy)
## [1] 85
There are 85 different candies represented.
Q2. How many fruity candy types are in the dataset?
sum(candy$fruity)
## [1] 38
There are 38 fruity types of candy in this dataset.
Q3. What is your favorite candy in the dataset and what is it’s winpercent value?
candy["Twix", ]$winpercent
## [1] 81.64291
My favorite candy is actually Twix, the win percent is 81.64291.
Q4. What is the winpercent value for “Kit Kat”?
candy["Kit Kat", ]$winpercent
## [1] 76.7686
The win percent for Kit Kats is 76.7686
Q5. What is the winpercent value for “Tootsie Roll Snack Bars”?
candy["Tootsie Roll Snack Bars", ]$winpercent
## [1] 49.6535
The win percent value for Tootsie Roll Snack Bars is 49.6535
library("skimr")
skim(candy)
| Name | candy |
| Number of rows | 85 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| chocolate | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| fruity | 0 | 1 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| caramel | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| peanutyalmondy | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| nougat | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| crispedricewafer | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hard | 0 | 1 | 0.18 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| bar | 0 | 1 | 0.25 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| pluribus | 0 | 1 | 0.52 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| sugarpercent | 0 | 1 | 0.48 | 0.28 | 0.01 | 0.22 | 0.47 | 0.73 | 0.99 | ▇▇▇▇▆ |
| pricepercent | 0 | 1 | 0.47 | 0.29 | 0.01 | 0.26 | 0.47 | 0.65 | 0.98 | ▇▇▇▇▆ |
| winpercent | 0 | 1 | 50.32 | 14.71 | 22.45 | 39.14 | 47.83 | 59.86 | 84.18 | ▃▇▆▅▂ |
Q6. Is there any variable/column that looks to be on a different scale to the majority of the other columns in the dataset?
The win percent variable seems to be on a different scale to the majority of the other columns in the dataset. All the other columns have data that translate to somewhere between 0-1, while the win percent variable has values from 0 to 84.2.
Q7. What do you think a zero and one represent for the candy$chocolate column?
A zero means that the candy is not a chocolate type, and the one means that it is a chocolate type. It acts sort of as a FALSE/TRUE indicator
#Histograms >Q8. Plot a histogram of winpercent values
hist(candy$winpercent)
Q9. Is the distribution of winpercent values symmetrical?
The distribution is symmetrical. It is still even on both sides, even though the center of distribution is slightly skewed to the left.
Q10. Is the center of the distribution above or below 50%?
The center of the distribution is below 50%
Q11. On average is chocolate candy higher or lower ranked than fruit candy? Q12. Is this difference statistically significant?
#We have to turn the column indicating whether something is chocolate/fruit into logical vectors
chocolatemean <- candy$winpercent[as.logical(candy$chocolate)]
fruitmean <- candy$winpercent[as.logical(candy$fruit)]
#Then we use t.test() in order to see the significance of the results
t.test(chocolatemean, fruitmean)
##
## Welch Two Sample t-test
##
## data: chocolatemean and fruitmean
## t = 6.2582, df = 68.882, p-value = 2.871e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 11.44563 22.15795
## sample estimates:
## mean of x mean of y
## 60.92153 44.11974
Through the t test we can see that our p-value is 2.871e-08, meaning that our results are statistically significant since the p-value is under 0.05.
#Overall Candy Rankings
Q13. What are the five least liked candy types in this set?
bottom5 <- head(candy[order(candy$winpercent),], n=5)
bottom5
## chocolate fruity caramel peanutyalmondy nougat
## Nik L Nip 0 1 0 0 0
## Boston Baked Beans 0 0 0 1 0
## Chiclets 0 1 0 0 0
## Super Bubble 0 1 0 0 0
## Jawbusters 0 1 0 0 0
## crispedricewafer hard bar pluribus sugarpercent pricepercent
## Nik L Nip 0 0 0 1 0.197 0.976
## Boston Baked Beans 0 0 0 1 0.313 0.511
## Chiclets 0 0 0 1 0.046 0.325
## Super Bubble 0 0 0 0 0.162 0.116
## Jawbusters 0 1 0 1 0.093 0.511
## winpercent
## Nik L Nip 22.44534
## Boston Baked Beans 23.41782
## Chiclets 24.52499
## Super Bubble 27.30386
## Jawbusters 28.12744
The least liked candies in this data set are Nik L Nip, Boston Baked Beans, Chiclets, Super Bubble, and Jawbusters. I prefer to use the R based code as of now because I am more comfortable with it. Dyplyr is still a bit too new for me.
Q14. What are the top 5 all time favorite candy types out of this set?
top5 <- head(candy[order(candy$winpercent, decreasing = TRUE),], n=5)
top5
## chocolate fruity caramel peanutyalmondy nougat
## ReeseÕs Peanut Butter cup 1 0 0 1 0
## ReeseÕs Miniatures 1 0 0 1 0
## Twix 1 0 1 0 0
## Kit Kat 1 0 0 0 0
## Snickers 1 0 1 1 1
## crispedricewafer hard bar pluribus sugarpercent
## ReeseÕs Peanut Butter cup 0 0 0 0 0.720
## ReeseÕs Miniatures 0 0 0 0 0.034
## Twix 1 0 1 0 0.546
## Kit Kat 1 0 1 0 0.313
## Snickers 0 0 1 0 0.546
## pricepercent winpercent
## ReeseÕs Peanut Butter cup 0.651 84.18029
## ReeseÕs Miniatures 0.279 81.86626
## Twix 0.906 81.64291
## Kit Kat 0.511 76.76860
## Snickers 0.651 76.67378
The top 5 favorite candies in this set are ReeseÕs Peanut Butter cup, ReeseÕs Miniatures, Twix, Kit Kat, and Snickers.
Q15. Make a first barplot of candy ranking based on winpercent values.
library(ggplot2)
#make a data frame of the data first
ggplot(candy) +
aes(winpercent, rownames(candy)) +
geom_bar(stat = "identity")
Let’s reorder
ggplot(candy) +
aes(winpercent, reorder(rownames(candy),winpercent)) +
geom_bar(stat = "identity")
Now let’s set up some colors for ourselves
my_cols=rep("black", nrow(candy))
my_cols[as.logical(candy$chocolate)] = "chocolate"
my_cols[as.logical(candy$bar)] = "blue"
my_cols[as.logical(candy$fruity)] = "pink"
Let’s try out the colors in our graph now
ggplot(candy) +
aes(winpercent, reorder(rownames(candy),winpercent)) +
geom_col(fill=my_cols)
Q17. What is the worst ranked chocolate candy?
The worst ranked chocolatecandy is Sixlets
Q18. What is the best ranked fruity candy?
The best ranked fruit candy is starburst
#Taking a look at pricepoint
#we need to load ggrepel before we can use it
library(ggrepel)
# How about a plot of price vs win
ggplot(candy) +
aes(winpercent, pricepercent, label=rownames(candy)) +
geom_point(col=my_cols) +
geom_text_repel(col=my_cols, size=3.3, max.overlaps = 5)
## Warning: ggrepel: 50 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Q19. Which candy type is the highest ranked in terms of winpercent for the least money - i.e. offers the most bang for your buck?
The most bang for your buck candy type seems to be fruit candy. Though the win percent is not the highest, it is still very high (~60%) and they are all very cheap (price percent ~0-0.30)
Q20. What are the top 5 most expensive candy types in the dataset and of these which is the least popular?
The 5 most expensive candy types in the data set are
#First let's find the top 5 most expensive candies
ord <- order(candy$pricepercent, decreasing = TRUE)
head( candy[ord,c(11,12)], n=5 )
## pricepercent winpercent
## Nik L Nip 0.976 22.44534
## Nestle Smarties 0.976 37.88719
## Ring pop 0.965 35.29076
## HersheyÕs Krackel 0.918 62.28448
## HersheyÕs Milk Chocolate 0.918 56.49050
The top 5 most expensive candies are Nik L Nip, Nestle Smarties, Ring Pops, HersheyÕs Krackel and HersheyÕs Milk Chocolate. Out of these, the least popular is Nik L Nip, as we can see from out graph. Of the 5 most expensive, it has the lowest win percent, it is the most left point of the 5 highest points. We can also see this from the chart.
Back to some other graphs
# Make a lollipop chart of pricepercent
ggplot(candy) +
aes(pricepercent, reorder(rownames(candy), pricepercent)) +
geom_segment(aes(yend = reorder(rownames(candy), pricepercent),
xend = 0), col="gray40") +
geom_point()
#Exploring the Correlation Structure
#first we load the package like usual
library(corrplot)
## corrplot 0.90 loaded
#let's run it!
cij <- cor(candy)
corrplot(cij)
Q22. Examining this plot what two variables are anti-correlated (i.e. have minus values)?
Fruity and chocolate are anti-correlated to each other.
Q23. Similarly, what two variables are most positively correlated?
Win percent and chocolate seem to be the most positively correlated.
#Principal Component Analysis
pca <- prcomp(candy, scale = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0788 1.1378 1.1092 1.07533 0.9518 0.81923 0.81530
## Proportion of Variance 0.3601 0.1079 0.1025 0.09636 0.0755 0.05593 0.05539
## Cumulative Proportion 0.3601 0.4680 0.5705 0.66688 0.7424 0.79830 0.85369
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.74530 0.67824 0.62349 0.43974 0.39760
## Proportion of Variance 0.04629 0.03833 0.03239 0.01611 0.01317
## Cumulative Proportion 0.89998 0.93832 0.97071 0.98683 1.00000
pca$rotation[,1]
## chocolate fruity caramel peanutyalmondy
## -0.4019466 0.3683883 -0.2299709 -0.2407155
## nougat crispedricewafer hard bar
## -0.2268102 -0.2215182 0.2111587 -0.3947433
## pluribus sugarpercent pricepercent winpercent
## 0.2600041 -0.1083088 -0.3207361 -0.3298035
#Now we plot it
plot(pca$x [,1:2])
How about we add some color
plot(pca$x[,1:2], col=my_cols, pch=16)
Before we can use ggplot, we have to make a new data frame, then we can plot it
# Make a new data-frame with our PCA results and candy data
my_data <- cbind(candy, pca$x[,1:3])
#Now we plot it
p <- ggplot(my_data) +
aes(x=PC1, y=PC2,
size=winpercent/100,
text=rownames(my_data),
label=rownames(my_data)) +
geom_point(col=my_cols)
p
Let’s try adding labels!
library(ggrepel)
p + geom_text_repel(size=3.3, col=my_cols, max.overlaps = 7) +
theme(legend.position = "none") +
labs(title="Halloween Candy PCA Space",
subtitle="Colored by type: chocolate bar (dark brown), chocolate other (light brown), fruity (red), other (black)",
caption="Data from 538")
## Warning: ggrepel: 39 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Let’s get plotly ready
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#try it out now
ggplotly(p)
If you hover over it it gives you information about that data point!
par(mar=c(8,4,2,2))
barplot(pca$rotation[,1], las=2, ylab="PC1 Contribution")
Q24. What original variables are picked up strongly by PC1 in the positive direction? Do these make sense to you?
Fruity, Hard, and Pluribus variables are picked up strongly by PC1 in the positive direction by PC1. When considering our correlation chart these groupings do make sense. Fruity candy seemed to be more strongly correlated with being hard and being pluribus. So these three variables that were picked up by PC1 make sense, especially since chocolate and fruity were seen to be anti-corrleated (-1), so they are also shown to be opposite in the above graph.